82
Algorithms for Binary Neural Networks
Omitting superscript ·t, we have the i-th component of ∂A
∂w as
( ∂A
∂w )i =
⎡
⎢⎢⎢⎢⎣
0
...
.
...
0
.
.
.
∂Ai,i
∂wi,1
...
∂Ai,i
∂wi,j
...
∂Ai,i
∂wi,J
.
.
.
0
...
.
...
0
⎤
⎥⎥⎥⎥⎦
,
(3.126)
we can derive
w ˆG(w, A) =
⎡
⎢⎢⎢⎢⎣
w1ˆg1
...
w1ˆgi
...
w1ˆgI
.
.
.
.
.
.
.
.
.
wIˆg1
...
wIˆgi
...
wIˆgI
⎤
⎥⎥⎥⎥⎦
.
(3.127)
Combining Eq. 3.126 and Eq. 3.127, we get
w ˆG(w, A)( ∂A
∂w )i =
⎡
⎢⎢⎢⎢⎢⎢⎣
w1ˆgi
∂Ai,i
∂wi,1
...
.
...
w1ˆgi
∂Ai,i
∂wi,j
.
.
.
wiˆgi
∂Ai,i
∂wi,1
...
.
...
wiˆgi
∂Ai,i
∂wi,J
.
.
.
wIˆgi
∂Ai,i
∂wi,1
...
.
...
wIˆgi
∂Ai,i
∂wiJ
⎤
⎥⎥⎥⎥⎥⎥⎦
.
(3.128)
After that, the i-th component of the trace item in Eq. 6.72 is then calculated by:
Tr[w ˆG( ∂A
∂w )i] = wiˆgi
J
j=1
∂Ai,i
∂wi,j
(3.129)
Combining Eq. 6.72 and Eq. 3.129, we can get
ˆwt+1 = wt+1 −η2λ
⎡
⎢⎢⎢⎢⎢⎢⎣
ˆgt
1
J
j=1
∂At
1,1
∂wt
1,j
.
.
.
ˆgt
I
J
j=1
∂At
I,I
∂wt
I,j
⎤
⎥⎥⎥⎥⎥⎥⎦
⊛
⎡
⎢⎢⎢⎢⎣
wt
1
.
.
.
wt
I
⎤
⎥⎥⎥⎥⎦
= wt+1 + η2λdt ⊛wt,
(3.130)
where η2 is the learning rate of the real value weight filters wi, ⊛denotes the Hadamard
product. We take dt = −[ˆgt
1
J
j=1
∂At
1,1
∂wt
1,j , · · · , ˆgt
I
J
j=1
∂At
i,i
∂wt
I,j ]T , which is unsolvable and un-
defined in the backpropagation of BNNs. To address this issue, we employ a recurrent model
to approximate dt and have
ˆwt+1 = wt+1 + U t ◦DReLU(wt, At),
(3.131)
and
wt+1 ←ˆwt+1,
(3.132)
where we introduce a hidden layer with channel-wise learnable weights U ∈RCout
+
to recur-
rently backtrack the w. We present DReLU to supervise such an optimization process to
realize a controllable recurrent optimization. Channel-wise, we implement DReLU as
DReLU(wi, Ai) =
wi
if (¬D(w′
i)) ∧D(Ai) = 1,
0
otherwise,
(3.133)